346 research outputs found
Codes Correcting Two Deletions
In this work, we investigate the problem of constructing codes capable of
correcting two deletions. In particular, we construct a code that requires
redundancy approximately 8 log n + O(log log n) bits of redundancy, where n is
the length of the code. To the best of the author's knowledge, this represents
the best known construction in that it requires the lowest number of redundant
bits for a code correcting two deletions
Lifting Weak Supervision To Structured Prediction
Weak supervision (WS) is a rich set of techniques that produce pseudolabels
by aggregating easily obtained but potentially noisy label estimates from a
variety of sources. WS is theoretically well understood for binary
classification, where simple approaches enable consistent estimation of
pseudolabel noise rates. Using this result, it has been shown that downstream
models trained on the pseudolabels have generalization guarantees nearly
identical to those trained on clean labels. While this is exciting, users often
wish to use WS for structured prediction, where the output space consists of
more than a binary or multi-class label set: e.g. rankings, graphs, manifolds,
and more. Do the favorable theoretical properties of WS for binary
classification lift to this setting? We answer this question in the affirmative
for a wide range of scenarios. For labels taking values in a finite metric
space, we introduce techniques new to weak supervision based on
pseudo-Euclidean embeddings and tensor decompositions, providing a
nearly-consistent noise rate estimator. For labels in constant-curvature
Riemannian manifolds, we introduce new invariants that also yield consistent
noise rate estimation. In both cases, when using the resulting pseudolabels in
concert with a flexible downstream model, we obtain generalization guarantees
nearly identical to those for models trained on clean data. Several of our
results, which can be viewed as robustness guarantees in structured prediction
with noisy labels, may be of independent interest. Empirical evaluation
validates our claims and shows the merits of the proposed method
Resonant Anomaly Detection with Multiple Reference Datasets
An important class of techniques for resonant anomaly detection in high
energy physics builds models that can distinguish between reference and target
datasets, where only the latter has appreciable signal. Such techniques,
including Classification Without Labels (CWoLa) and Simulation Assisted
Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset.
They cannot take advantage of commonly-available multiple datasets and thus
cannot fully exploit available information. In this work, we propose
generalizations of CWoLa and SALAD for settings where multiple reference
datasets are available, building on weak supervision techniques. We demonstrate
improved performance in a number of settings with realistic and synthetic data.
As an added benefit, our generalizations enable us to provide finite-sample
guarantees, improving on existing asymptotic analyses
Zero-Shot Robustification of Zero-Shot Models With Foundation Models
Zero-shot inference is a powerful paradigm that enables the use of large
pretrained models for downstream classification tasks without further training.
However, these models are vulnerable to inherited biases that can impact their
performance. The traditional solution is fine-tuning, but this undermines the
key advantage of pretrained models, which is their ability to be used
out-of-the-box. We propose RoboShot, a method that improves the robustness of
pretrained model embeddings in a fully zero-shot fashion. First, we use
zero-shot language models (LMs) to obtain useful insights from task
descriptions. These insights are embedded and used to remove harmful and boost
useful components in embeddings -- without any supervision. Theoretically, we
provide a simple and tractable model for biases in zero-shot embeddings and
give a result characterizing under what conditions our approach can boost
performance. Empirically, we evaluate RoboShot on nine image and NLP
classification tasks and show an average improvement of 15.98% over several
zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible
with a variety of pretrained and language models
Good Data from Bad Models : Foundations of Threshold-based Auto-labeling
Creating large-scale high-quality labeled datasets is a major bottleneck in
supervised machine learning workflows. Auto-labeling systems are a promising
way to reduce reliance on manual labeling for dataset construction.
Threshold-based auto-labeling, where validation data obtained from humans is
used to find a threshold for confidence above which the data is
machine-labeled, is emerging as a popular solution used widely in practice.
Given the long shelf-life and diverse usage of the resulting datasets,
understanding when the data obtained by such auto-labeling systems can be
relied on is crucial. In this work, we analyze threshold-based auto-labeling
systems and derive sample complexity bounds on the amount of human-labeled
validation data required for guaranteeing the quality of machine-labeled data.
Our results provide two insights. First, reasonable chunks of the unlabeled
data can be automatically and accurately labeled by seemingly bad models.
Second, a hidden downside of threshold-based auto-labeling systems is
potentially prohibitive validation data usage. Together, these insights
describe the promise and pitfalls of using such systems. We validate our
theoretical guarantees with simulations and study the efficacy of
threshold-based auto-labeling on real datasets
- ā¦